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e Data locality challenges in large-scale data lakes 
e Alluxio: An open-source caching framework 
o Improve both performance and cost efficiency 
o Distributed cache and embedded cache 
e Uber: Using cache for exabyte-scale data lake 
o HDFS DataNode local cache for high-density HDD adoption 
o Prestolocal cache 


10yr 
Ago 


Today 


The Evolution of the Modern Data Stack 


Tightly-Coupled 


MapReduce 6 HDFS On-Prem HDFS 


Compute-Storage 
Separation 


y Y 


More Elastic, Cheaper, Easier to Manage, More Scalable 


Cloud Data Lake K8s/Containerization 


JX ALLUXIO 


Today 


We are Losing Data Locality 


Compute storage Cloud Data Lake K8s/Containerization 
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Data locality is missing... 


JX ALLUXIO 


Slow & Expensive 


Slow and inconsistent data access 
performance 


Fast-growing cloud storage costs, 


including API and egress costs 


High data operation costs when 
migrating to the cloud 


The Implications of Losing Data Locality 


Complex Platform 


Data copies, synchronization costs 


Multiple APls necessitate 
integration and application rewrites 


On-premise, cloud, hybrid, 
multi-cloud environments all have 
different environment properties 
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Alluxio: A Critical Caching Framework in the Open-Source Data Stack 


HDFS Interface Java File API POSIX Interface S3 Interface 
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OPEN-SOURCE CACHING LAYER 


HDFS Driver S3 Driver GCS Driver Azure Driver 
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Open-Source Started From UC Berkeley AMPLab in 2014 


Top 10 Most Critical Java 


; Based Open Source Project 
Join the 


conversation on 1,200+ contributors 
Slack & growing 
alluxio.io/slack 


GitHub's Top 100 Most 
Valuable Repositories 
Out of 96 Million 
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11,000+ Slack 
Community Members 
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Alluxio Technology Journey 


EXPLOSION OF DATA CLOUD ADOPTION GENERATIVE Al 
rise of big data 8 analytics Single to hybrid cloud, Large-scale model training 
multi-cloud, cross region and deployment 
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A Caching Framework to Fit Different Needs 


e Run as a library inthe application 
processes (Presto, HDFS DataNode) 

e Leverage local disk NVMe or memory 

e When the size of hot data fıts local disks 


Alluxio 
Embedded/Local 
Cache 


Alluxio e Standalone cache service shareable 


Distributed across applications 
Cache e Cache capacity scales horizontally 
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Multi-level Caching 
LI Embedded Cache+L2 Distributed Cache 


Alluxio| Alluxio 
mbedded Distributed HDFS/ 


Parquet 


ParquetReader 1MB Bytes Pages N 


ORCReader 


Parquet 


MetaData 
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Battle tested in Uber, Meta, 
Tiktok and etc. 

Support Iceberg, Hudi, Delta 
Lake and Hive tables 

Support varied file format 
such as Paquet, ORC and CSV 
Fully optimized for local 
NVMe storages 


Embedded 


Alluxio Embedded Cache in Presto Cache 
[^ — e e e e e áe NM M. m. 
¡Presto Server JVM | 
i l 
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| | 
| Alluxio Caching l 
l File System l 
l On Cache Miss On Cache Hit l 

l 
| External Alluxio Cache | 
i File System Manager i 


External Local cache 


Storage storage 
WX, ALLUXIO 


b Alluxio Embedded Cache Data Management 


Alluxio embedded cache provides cache eviction and admission 
e Support LRU and FIFO cache eviction policy 
e Support customized cache admission policy 
e Support TTL 
e Support data quota 
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TPC 


Cache vs. S3 
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Cache vs. S3 
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ê 5 Distributed 
Alluxio Cluster Architecture 
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Registry 
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Uber: Using Cache for b 
Exabyte-scale Data Lake 
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Data informs every decision at Uber 
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Na Adopting High-Density HDD in HDFS 


e Capacity per Host: 4TB * 24  16TB * 35 


e Efficiency: »5076 HW cost reduction "PEE 
e Challenges m 
o DataNode lO performance 
o HDFS reliability (blast radius) | 1,074,622 
e Opportunities | 633,923 
o Traffic focuses on a small group of węża 
extremely hot blocks =. 
o Top 10K blocks attracted >90% read Wro 
traffic 
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Number 
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read 
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59376 
49317 
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82.85% 


81.62% 


Read 
traffic on 
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blocks (1 
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DataNode Local Cache 
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e Build a local cache within DataNode 
o 4TBNVMe SSD disk 
o Based on DataNode local traffic 


e leverage Alluxio for cache management Pê 
HDFS Client 
o  Page-level cache ey SN 


o IMB default page size matches traffic 
pattern 
e Exclude Non-Client Reads 


Cache Ratelimit 


Cache Read Non-Cache Read 


Pull To 
Cache 
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Support Append to a Block 


e HDFS blocks may not be immutable due to Append ops 
e leverage HDFS generation stamp to achieve "snapshot isolation" 


DataNode DataNode DataNode 


blk 1234 0001 bik 1234 0002 blk 1234 0002 


cache cache cache 


blk 1234 0001 > blk_1234_0001 blk_1234_0001 


blk 1234 0002 


After Reading Block:1234 After Append Block:1234 After Reading Block:1234 
Generation: 0001 Generation: 0002 
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Rate Limiter 


e 1:140 SSD-to-HDD ratio 
Need mechanisms to control what data can be loaded into Cache 


o Cache hit rate 
o SSD write endurance 
e Track block access frequency within a sliding time window 


i - N th minute i -1 th minute Current Bucket: i th minute 
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— Evaluation and Rollout 
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DataNode traces 
e Performance evaluation: DataNode trace TIEN 
replay +. 
o Test with scale na 
o Reproduce production traffic profile | 
e Alluxio cache “shadow mode” el conan gee 
o A simulated cache mode to collect | 
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Measurement 
Test HDFS cluster results 
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Current Status 


e Deployed in all major production clusters (1200 DataNodes) 
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U - e [ Effectively reduced the 4 of processes blocked by IO 
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Cached data %: CacheRead/AllRead 
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Offload 60% of IO from HDD 


Cache read throughput is 
significantly higher, nearly 
twice that of non-cache read 
throughput 
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3 Nodes Queries/day HDFS bytes read/day 


Workloads 


Interactive Batch 
Ad hoc queries Scheduled 


Interactive Presto 
Cluster Batch Presto Cluster 
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Presto Local Disk Cache 


E Leverage Presto workers local NVMe Presto Worker JVM 
SSD disks 
Selective caching based on table 
Leverage Alluxio for cache 
management in Presto worker 

e Forced cache TTL to meet compliance 
requirements 


Presto Worker 
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Key Challenges © Uber and Solutions 


Realtime Partition Updates 
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Problem: Tables/partitions are constantly changing, naive caching could cause 
returning stale data. 
Solution: Integrated HDFS file modification time as part of cache key 


Cluster Membership Change 
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Problem: Presto nodes always leave and join cluster, causing file read to route to 
wrong node and thus cache miss. 

Solution: Introduced consistent hashing to tie files to certain Presto worker 
nodes. 


Key Challenges © Uber and Solutions 


e Cache Size Restriction 
o Problem: Available cache space is limited compared to total data that needs to 
be scanned. 
o Solution: Introduced cache filter to selectively cache only certain hot data (top 
accessed tables). 
e Tiny Reads 
o Problem: Ubers Presto traffic often involves a series of tiny consecutive reads, 
leading to degraded behavior in Alluxio. 
o Solution: Introduced buffer reads in Alluxio caching. 
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b Presto Local SSD Cache Onprem - Today 


e Deployment 
o Deployed to Presto production since 2022 
o Clusters run with local cache 
m 5 clusters (out of 11 batch clusters) 
m 1500 nodes (43% of all batch nodes in primary region) 
e improvement 
o “13% Presto batch HDFS read offloaded to cache 
o Inputread latency reduced by *44% compared to HDFS 
reads 
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Presto Local SSD Cache On Cloud - Initial Evaluation 


e Deployed to Presto on cloud for 
initial evaluation 
e Cost reduction AND Without Cache 
performance improvement I —— — 
o >80% reduction of # of | m TN bi 
read requests to GCS 
o 228 s > 50 s reduction of 
P90 query latency 
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